HTML Tag Based Metrics for use in Web Page Type Classification
نویسندگان
چکیده
Traditional machine learning classifications of HTML documents focus on features drawn from terms in the documents, the link structure of groups of documents, or a combination of both. These techniques attempt to generate topical classifications of documents, with the hopes of mirroring a human's classification of pages into subject areas, thus facilitating retrieval. This paper presents an alternative method that aims at generating a "type-wise" classification of HTML documents. The types explored in this paper include tables, indexes, tables of contents, and textual content pages. These types of pages are of particular significance to the classification of documents on statistical web sites, which is one goal of the GovStat Project (http://www.ils.unc.edu/govstat), but also hold significance to HTML document collections at large.
منابع مشابه
An Improved Optimized Web Page Classification using Firefly Algorithm with NB Classifier (WPCNB)
The web is a huge repository of information which needs for accurate automated classifiers for Web pages to maintain Web directories and to increase search engines‟ performance. In web page classification problem each term in each HTML/XML tag of each Web page can be taken as a feature, an efficient methods to select best features to reduce feature space of the Web page classification problem d...
متن کاملA fast HTML web page change detection approach based on hashing and reducing the number of similarity computations
This paper describes a fast HTML Web page detection approach that saves computation time by limiting the similarity computations between two versions of a Web page to nodes having the same HTML tag type, and by hashing the web page in order to provide direct access to node information. This efficient approach is suitable as a client application and for implementing server applications that coul...
متن کاملEnhanced Information Retrieval by Using HTML Tags
Whenever digital libraries or knowledge management systems are to be automatically filled with web pages from the internet, document classification of the web pages is one of the major challenges. We present an approach which uses HTML tags in order to improve the quality of the hypertext document classification. Our approach uses weighting of HTML tags for separating relevant information in hy...
متن کاملFeature Weighting Improvement of Web Text Categorization Based on Particle Swarm Optimization Algorithm
It is usually true that some structures like title can express the main content of texts, and these structures may have an influence on the effectiveness of text categorization. However, the most common feature weighting algorithms, called term frequency-inverse document frequency (TF-IDF) doesn’t think about the structural information of texts. To solve this problem, a new feature weighting al...
متن کاملAn Approach to Content Extraction from Scientific Articles using Case-Based Reasoning
In this paper, we present an efficient approach for content extraction of scientific papers from web pages. The approach uses an artificial intelligence method, Case-Based Reasoning(CBR), that relies on the idea that similar problems have similar solutions and hence reuses past experiences to solve new problems or tasks. The key task of content extraction is the classification of HTML tag seque...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004